GitHub Repository: debakarr/machinelearning
Path: blob/master/Part 2 - Regression/Multiple Linear Regression/[R] Multiple Linear Regression.ipynb
¹³³⁹ views

Kernel: R

Multiple Linear Regression

In [1]:

library("IRdisplay")

In [2]:

display_png(file="img/01.png")

Out[2]:

In [3]:

display_png(file="img/02.png")

Out[3]:

Green = Dependent variable

Blue = Independent variable

In [4]:

display_png(file="img/03.png")

Out[4]:

Data Preprocessing

In [6]:

# Importing the dataset
dataset = read.csv('50_Startups.csv')

In [7]:

dataset

Out[7]:

Encoding categorical data

In [8]:

dataset$State = factor(dataset$State,
                       levels = c('New York', 'California', 'Florida'),
                       labels = c(1, 2, 3))

In [9]:

dataset

Out[9]:

In [11]:

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)

In [12]:

training_set

Out[12]:

In [13]:

test_set

Out[13]:

Fitting Multiple Linear Regression to the training set

In [24]:

# regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State, 
#                data = training_set)
regressor = lm(formula = Profit ~ ., data = training_set)

In [25]:

summary(regressor)

Out[25]:

Call:
lm(formula = Profit ~ ., data = training_set)

Residuals:
   Min     1Q Median     3Q    Max 
-33128  -4865      5   6098  18065 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.965e+04  7.637e+03   6.501 1.94e-07 ***
R.D.Spend        7.986e-01  5.604e-02  14.251 6.70e-16 ***
Administration  -2.942e-02  5.828e-02  -0.505    0.617    
Marketing.Spend  3.268e-02  2.127e-02   1.537    0.134    
State2           1.213e+02  3.751e+03   0.032    0.974    
State3           2.376e+02  4.127e+03   0.058    0.954    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9908 on 34 degrees of freedom
Multiple R-squared:  0.9499,	Adjusted R-squared:  0.9425 
F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16

Predicting the Test set result

In [26]:

y_pred = predict(regressor, newdata = test_set)

In [27]:

y_pred # Predicted value

Out[27]:

In [28]:

test_set$Profit # Real value

Out[28]:

Fitting Multiple Linear Regression to the training set using only R & D Spend

In [30]:

regressor = lm(formula = Profit ~ R.D.Spend, data = training_set)
y_pred = predict(regressor, newdata = test_set)
y_pred # Predicted value

Out[30]:

Building the optimal model using Backward Elimination

In [33]:

regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State, data = training_set)
summary(regressor)

Out[33]:

Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + 
    State, data = training_set)

Residuals:
   Min     1Q Median     3Q    Max 
-33128  -4865      5   6098  18065 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.965e+04  7.637e+03   6.501 1.94e-07 ***
R.D.Spend        7.986e-01  5.604e-02  14.251 6.70e-16 ***
Administration  -2.942e-02  5.828e-02  -0.505    0.617    
Marketing.Spend  3.268e-02  2.127e-02   1.537    0.134    
State2           1.213e+02  3.751e+03   0.032    0.974    
State3           2.376e+02  4.127e+03   0.058    0.954    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9908 on 34 degrees of freedom
Multiple R-squared:  0.9499,	Adjusted R-squared:  0.9425 
F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16

In [34]:

# Remove State as p_value is highest i.e. p_value>0.05
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, data = training_set)
summary(regressor)

Out[34]:

Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, 
    data = training_set)

Residuals:
   Min     1Q Median     3Q    Max 
-33117  -4858    -36   6020  17957 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.970e+04  7.120e+03   6.980 3.48e-08 ***
R.D.Spend        7.983e-01  5.356e-02  14.905  < 2e-16 ***
Administration  -2.895e-02  5.603e-02  -0.517    0.609    
Marketing.Spend  3.283e-02  1.987e-02   1.652    0.107    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9629 on 36 degrees of freedom
Multiple R-squared:  0.9499,	Adjusted R-squared:  0.9457 
F-statistic: 227.6 on 3 and 36 DF,  p-value: < 2.2e-16

In [35]:

# Remove Administration as p_value is highest i.e. p_value>0.05
regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = training_set)
summary(regressor)

Out[35]:

Call:
lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = training_set)

Residuals:
   Min     1Q Median     3Q    Max 
-33294  -4763   -354   6351  17693 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     4.638e+04  3.019e+03  15.364   <2e-16 ***
R.D.Spend       7.879e-01  4.916e-02  16.026   <2e-16 ***
Marketing.Spend 3.538e-02  1.905e-02   1.857   0.0713 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9533 on 37 degrees of freedom
Multiple R-squared:  0.9495,	Adjusted R-squared:  0.9468 
F-statistic: 348.1 on 2 and 37 DF,  p-value: < 2.2e-16

In [36]:

# Remove Administration as p_value is highest i.e. p_value>0.05
regressor = lm(formula = Profit ~ R.D.Spend, data = training_set)
summary(regressor)

Out[36]:

Call:
lm(formula = Profit ~ R.D.Spend, data = training_set)

Residuals:
   Min     1Q Median     3Q    Max 
-34334  -4894   -340   6752  17147 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.902e+04  2.748e+03   17.84   <2e-16 ***
R.D.Spend   8.563e-01  3.357e-02   25.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9836 on 38 degrees of freedom
Multiple R-squared:  0.9448,	Adjusted R-squared:  0.9434 
F-statistic: 650.8 on 1 and 38 DF,  p-value: < 2.2e-16

Multiple Linear Regression

Green = Dependent variable

Blue = Independent variable

Data Preprocessing

Encoding categorical data

Fitting Multiple Linear Regression to the training set

Predicting the Test set result

Fitting Multiple Linear Regression to the training set using only R & D Spend

Building the optimal model using Backward Elimination

Product

Resources

Company